ggplot2 is a data visualization package written by Hadley Wickham that uses the “grammar of graphics.” The grammar of graphics provides a consistent way to describe the components of graph, allowing us to move beyond specific types of plots (e.g., boxplot, scatterplot, etc.) to different elements that compose the plot. As the name would imply, the grammar of graphics is a language we can use to describe and build visualizations.
Today, we’ll look at the basic syntax of ggplot2 graphics, as well as some other tidyverse tools, using simulated regression data.
library(ggplot2)
library(dplyr)
If you are so inclined, all of the code for this document is on my Github page.
First, we’ll define a function to generate data.
generate_data <- function(n, b0, b1, b2, bint, seed) {
set.seed(seed)
x1 <- rnorm(n = n, mean = 0, sd = 1)
x2 <- sample(factor(c("Male", "Female")), size = n, replace = TRUE,
prob = c(0.4, 0.6))
x3 <- sample(factor(c("Caucasian", "Hispanic", "African American")), size = n,
replace = TRUE, prob = c(0.5, 0.2, 0.3))
e <- rnorm(n = n, mean = 0, sd = sqrt(10))
y <- b0 + (b1 * x1) + (b2 * as.numeric(x2)) + (bint * x1 * as.numeric(x2)) + e
data_frame(outcome = y, predictor = x1, gender = x2, race = x3)
}
And then, we will use that function to generate a sample for our example.
mlm_data <- generate_data(n = 1000, b0 = 3, b1 = 5, b2 = 3, bint = 4,
seed = 9416)
mlm_data
#> # A tibble: 1,000 × 4
#> outcome predictor gender race
#> <dbl> <dbl> <fctr> <fctr>
#> 1 -1.1238447 -0.7430094 Male Caucasian
#> 2 10.8342589 0.2046086 Female Hispanic
#> 3 -1.7894618 -0.7236642 Male Caucasian
#> 4 3.6140835 0.3188742 Female Hispanic
#> 5 6.2861291 0.1414323 Female Hispanic
#> 6 -8.8799150 -1.2760527 Male Caucasian
#> 7 0.3724540 -0.8999638 Male Caucasian
#> 8 33.7692404 1.5350762 Male Caucasian
#> 9 24.5912195 1.4025014 Male Caucasian
#> 10 -0.4575962 -0.6454665 Male African American
#> # ... with 990 more rows
Because ggplot2 is built on the grammar of graphics, the code for almost all plots will follow the same format.
ggplot(data = <data>, mapping = aes(<mappings>)) +
geom_<element>()
In this structure data defines the data for the plot, mapping defines how the aesthetics are mapped to different variables, and the geom commands add elements to the plot. For example, using our simulated data, we can map the predictor to the x-axis, the outcome to the y-axis, and make a scatterplot.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point()
We could also make a bar plot to show the number of respondents from each group.
ggplot(data = mlm_data, mapping = aes(x = race)) +
geom_bar()
Or we could make a histogram to look at the distribution of our outcome variable.
ggplot(data = mlm_data, mapping = aes(x = outcome)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that for the barplot and histogram we did not define an aesthetic for the y-axis. By default ggplot2 will calculate the count for each value on the x-axis. For each geom, the help pages will tell you which aesthetics are required, and which other aesthetics can be specified if desired (e.g., ?geom_histogram).
Let’s go back to our scatterplot to look at how we can change the details to look more like what we want.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point()
We can change aspects of the geom itself by adding arguments to the geom call.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(color = "blue", size = 3, alpha = 0.3, shape = 15)
Here, we’ve make the dots square, bigger, blue, and slightly transparent. A full list of available shapes is available here.
It is also possible to map these aesthetics to variables in the dataset, just like we did with the axes.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
color = gender)) +
geom_point()
Now each gender has its own color, and a legend is automatically generated. We can also mix aesthetics that are and are not mapped to variables.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
color = gender)) +
geom_point(shape = 15, alpha = 0.6, size = 3)
Here, color is still assigned to gender, but the shape and alpha aesthetics are applied to the entire geom.
Often, we want to add additional elements to our plots. This is straightforward using ggplot2, we simply add another geom.
ggplot(data = mlm_data, aes(x = predictor, y = outcome)) +
geom_point() +
geom_smooth(method = "lm")
By default, geom_smooth uses method = "gam" for sample greater than or equal to 1000, but we can choose a linear model by using method = "lm". Just like before we can also map aesthetics to other variables.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
color = gender)) +
geom_point() +
geom_smooth(method = "lm")
It’s also possible to apply map aesthetics to additional variables for only specific geoms.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point() +
geom_smooth(mapping = aes(color = gender), method = "lm")
Notice that we’ve moved the color mapping to the geom_smooth call. This results in a different smoothed line for each group, but this is not extended to the points. Aesthetics that are defined in the top ggplot call are global and get applied to all geoms, whereas aesthetics defined within the geom are local and apply only to that specific geom.
This can also be applied to data. For example, we could only plot points from the Hispanic group, but use the full data set to fit the smooth lines.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = filter(mlm_data, race == "Hispanic")) +
geom_smooth(mapping = aes(color = gender), method = "lm")
Sometimes it can be beneficial to look at groups separately, rather than together in a single plot. This can be accomplished with facetting.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ gender)
It may also be helpful to plot the full data within each facet and just highlight the specific group. This can be accomplished by using two calls to geom_point, and removing the facetting variable in the first.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
geom_point(mapping = aes(color = gender), alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ gender)
So far we’ve looked at how we can use geoms and aesthetics to create the elements of a plot. But ggplot2 also provides methods for formatting the plots to look exactly how you want. For example, we can add titles and change scales.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
geom_point(mapping = aes(color = gender), alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ gender) +
labs(
x = "An important predictor",
y = "Representative outcome",
title = "An important finding",
subtitle = "More details about this very important thing"
) +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
scale_y_continuous(breaks = seq(-100, 100, 10))
We can also define the colors that get used.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
geom_point(mapping = aes(color = gender), alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ gender) +
labs(
x = "An important predictor",
y = "Representative outcome",
title = "An important finding",
subtitle = "More details about this very important thing"
) +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
scale_y_continuous(breaks = seq(-100, 100, 10)) +
scale_color_manual(values = c("red", "blue"))
Basically anything you want to change about the looks can be altered with scales or themes.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
geom_point(mapping = aes(color = gender), alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~ gender) +
labs(
x = "An important predictor",
y = "Representative outcome",
title = "An important finding",
subtitle = "More details about this very important thing"
) +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
scale_y_continuous(breaks = seq(-100, 100, 10)) +
scale_color_manual(values = c("red", "blue")) +
theme_bw() +
theme(
legend.position = "bottom",
panel.grid.minor.x = element_blank(),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
axis.title = element_text(size = 8)
)
To format legends, we can use the guides function.
ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
geom_point(mapping = aes(color = gender), alpha = 0.5) +
geom_smooth(method = "lm", color = "gold") +
facet_wrap(~ gender) +
labs(
x = "An important predictor",
y = "Representative outcome",
title = "An important finding",
subtitle = "More details about this very important thing"
) +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
scale_y_continuous(breaks = seq(-100, 100, 10)) +
scale_color_manual(values = c("red", "blue")) +
theme_bw() +
theme(
legend.position = "bottom",
panel.grid.minor.x = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 12),
axis.title = element_text(size = 10),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8)
) +
guides(
color = guide_legend(title = "Gender", title.position = "top",
title.hjust = 0.5, label.position = "bottom", label.hjust = 0.5,
keywidth = unit(1, "cm"), override.aes = list(alpha = 1, size = 3))
)
As can be seen from this last plot, the downside to ggplot2 is that the code to create a plot can become quite verbose. However, this is because we are able to alter almost any aspect of the plot.
So far, we’ve only talked in detail about a few commands that would be beneficial for creating plots typical of a regression. However there are many more geoms, scales, and theme options to create almost any type of graphic you can think of.
For example, we can plot how student adapt through different levels of an adaptive assessment.
Or we can look at the probability of a respondent providing the correct response to an item, given their ability, in different types of psychometric models.
We can also use heat maps to compare the amount of error present for combinations of variables.
Alternatively we could do more fun things like look at which US cities have the most breweries.
Or look at the distribution of brewery ratings for the surrounding states.
Finally, there are many extensions to ggplot2 (like the gganimate package from David Robinson), which we can use to plot the probability of KU winning a basketball game over time.